150
Applications in Computer Vision
ric space distances to learn local features with increasing contextual scales, with novel set
learning layers to adaptively combine features from multi-scale based on uniform densities.
PointCNN [134] is introduced to learn an X transformation from input points to simulta-
neously weigh the input features associated with the points and then permute them into
latent potentially canonical order. Grid-GCN [256] takes advantage of the Coverage-Aware
Grid Query (CAGQ) strategy for point-cloud processing, which leverages the efficiency of
grid space. In this way, Grid-GCN improves spatial coverage while reducing theoretical time
complexity.
6.1.3
Object Detection
Deep Learning based object detection can generally be classified into two categories:
two-stage and single-stage object detection. Two-stage detectors, for example, Faster R-
CNN [201], FPN [143], and Cascade R-CNN [30], generate region proposals in the first
stage and refine them in the second. In localization, R-CNN [73] utilizes the L2 norm
between predicted and target offsets as the object function, which can cause gradient ex-
plosions when errors are significant. Fast R-CNN [72] and Faster R-CNN [201] proposed a
smooth loss of L1 that keeps the gradient of large prediction errors consistent. One-stage
detectors, e.g., RetinaNet [144] and YOLO [200], classify and regress objects concurrently,
which are highly efficient but suffer from lower accuracy. Recent methods [276, 202] have
been used to improve localization accuracy using IoU (Insertion over Union)-related values
as regression targets. IoU Loss [276] utilized the negative log of IoU as object functions
directly, which incorporates the dependency between box coordinates and adapts to multi-
scale training. GIoU [202] extends the IoU loss to non-overlapping cases by considering the
shape properties of the compared objects. CIoU Loss [293] incorporates more geometric
measurements, that is, overlap area, central point distance, and aspect ratio, and achieves
better convergence.
6.1.4
Speech Recognition
Speech recognition is an automatic technology that converts human voice content into the
corresponding text by computers. Because of its widespread prospects, speech recognition
has become one of the most popular topics in academic research and industrial applica-
tions. In recent years, speech recognition has improved rapidly with th‘e development of
deep convolutional neural networks (DCNNs). WaveNet [183] is one of the most advanced
frameworks for speech recognition. When assigned languages and audio spectrograms are
given, they can be recognized vividly and converse text to speech in high quality. The
data-driven vocoders avoid the error of the process of estimating the speech spectrum and
phase information, then combine them to return the speech waveform. The data-driven
vocoders is the key to which WaveNets naturally produce voice. The key to naturally pro-
duce voice about WaveNets is that new data-driven vocoders [178] avoid the error problem
of the process when estimating the speech spectrum and phase information separately and
then combine them to return the speech waveform. Instead of traditional speech recognition
applications on remote servers, speech recognition is gradually becoming popular on mo-
bile devices. However, the requirements of abundant memory and computational resources
restrict full precision neural networks. Before solving the hardware deployment problem
on mobile devices, we were unable to run or store these DCNNs with huge amounts of
parameters.